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Abstract 

Summarization based on text extraction is 
inherently limited, but generation-style ab¬ 
stractive methods have proven challeng¬ 
ing to build. In this work, we propose 
a fully data-driven approach to abstrac¬ 
tive sentence summarization. Our method 
utilizes a local attention-based model that 
generates each word of the summary con¬ 
ditioned on the input sentence. While the 
model is structurally simple, it can eas¬ 
ily be trained end-to-end and scales to a 
large amount of training data. The model 
shows significant performance gains on 
the DUC-2004 shared task compared with 
several strong baselines. 

1 Introduction 
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Figure 1: Example output of the attention-based summa¬ 
rization (ABS) system. The heatmap represents a soft align¬ 
ment between the input (right) and the generated summary 
(top). The columns represent the distribution over the input 
after generating each word. 


Summarization is an important challenge of natu¬ 
ral language understanding. The aim is to produce 
a condensed representation of an input text that 
captures the core meaning of the original. Most 
successful summarization systems utilize extrac¬ 
tive approaches that crop out and stitch together 
portions of the text to produce a condensed ver¬ 
sion. In contrast, abstractive summarization at¬ 
tempts to produce a bottom-up summary, aspects 
of which may not appear as part of the original. 

We focus on the task of sentence-level sum¬ 
marization. While much work on this task has 
looked at deletion-based sentence compression 
techniques (Knight and Marcu (2002), among 
many others), studies of human summarizers show 
that it is common to apply various other operations 
while condensing, such as paraphrasing, general¬ 
ization, and reordering (Jing, 2002). Past work 
has modeled this abstractive summarization prob¬ 
lem either using linguistically-inspired constraints 
(Dorr et al., 2003; Zajic et al., 2004) or with syn¬ 
tactic transformations of the input text (Cohn and 


Lapata, 2008; Woodsend et al., 2010). These ap¬ 
proaches are described in more detail in Section 6. 

We instead explore a fully data-driven approach 
for generating abstractive summaries. Inspired by 
the recent success of neural machine translation, 
we combine a neural language model with a con¬ 
textual input encoder. Our encoder is modeled 
off of the attention-based encoder of Bahdanau et 
al. (2014) in that it learns a latent soft alignment 
over the input text to help inform the summary (as 
shown in Figure 1). Crucially both the encoder 
and the generation model are trained jointly on the 
sentence summarization task. The model is de¬ 
scribed in detail in Section 3. Our model also in¬ 
corporates a beam-search decoder as well as addi¬ 
tional features to model extractive elements; these 
aspects are discussed in Sections 4 and 5. 

This approach to summarization, which we call 
Attention-Based Summarization (Abs), incorpo¬ 
rates less linguistic structure than comparable ab¬ 
stractive summarization approaches, but can easily 


Input (xi,..., xis). First sentence of article: 

russian defense minister ivanov called Sunday for the creation of a joint front for combating global terrorism 
Output (yi,..., yg). Generated headline: 

russia calls for joint front against terrorism g{terrorism,'K, for, joint, front, against) 


Figure 2: Example input sentence and the generated summary. The score of generating yi+i (terrorism) is based on the 
context yc (for ... against) as well as the input xi ... xig. Note that the summary generated is abstractive which makes 
it possible to generalize (russian defense minister to russia) and paraphrase (for combating to against), 
in addition to compressing (dropping the creation of), see Jing (2002) for a survey of these editing operations. 


scale to train on a large amount of data. Since our 
system makes no assumptions about the vocabu¬ 
lary of the generated summary it can be trained 
directly on any document-summary pair.^ This 
allows us to train a summarization model for 
headline-generation on a corpus of article pairs 
from Gigaword (Graff et al., 2003) consisting of 
around 4 million articles. An example of genera¬ 
tion is given in Figure 2, and we discuss the details 
of this task in Section 7. 

To test the effectiveness of this approach we 
run extensive comparisons with multiple abstrac¬ 
tive and extractive baselines, including traditional 
syntax-based systems, integer linear program- 
constrained systems, information-retrieval style 
approaches, as well as statistical phrase-based ma¬ 
chine translation. Section 8 describes the results 
of these experiments. Our approach outperforms 
a machine translation system trained on the same 
large-scale dataset and yields a large improvement 
over the highest scoring system in the DUC-2004 
competition. 

2 Background 

We begin by defining the sentence summarization 
task. Given an input sentence, the goal is to pro¬ 
duce a condensed summary. Let the input con¬ 
sist of a sequence of M words xi,..., com¬ 
ing from a fixed vocabulary V of size |V| = V . 
We will represenf each word as an indicafor vecfor 
Xj G {0,1}^ for f G {1,..., M}, sentences as a 
sequence of indicafors, and Tf as fhe sef of possi¬ 
ble inpufs. Furfhermore define fhe nofafion xjj ^ 
fo indicate fhe sub-sequence of elemenfs i, j, k. 

A summarizer fakes x as inpuf and oufpufs a 
shortened senfence y of lengfh N < M. We will 
assume fhaf fhe words in fhe summary also come 
from fhe same vocabulary V and fhaf fhe oufpuf is 

*In contrast to a large-scale sentence compression sys¬ 
tems like Filippova and Altun (2013) which require mono¬ 
tonic aligned compressions. 


a sequence yi,..., y^r. Note fhaf in confrasf fo 
relafed (asks, like machine (ranslafion, we will as¬ 
sume fhaf (he oufpuf lengfh N is fixed, and fhaf 
fhe sysfem knows fhe lengfh of fhe summary be¬ 
fore generafion.^ 

Nexf consider fhe problem of gen- 
erafing summaries. Define fhe sef 

y C ({0,1}^,..., {0,1}'^) as all possible 
senfences of lengfh N, i.e. for all i and y G 3^, y* 
is an indicafor. We say a sysfem is abstractive if if 
fries fo find fhe optimal sequence from (his sef y, 

argmaxs(x, y), (1) 

yey 

under a scoring funcfion s : A x 3^ i—)■ M. Confrasf 
fhis fo a fully extractive sentence summary^ which 
(ransfers words from (he inpuf: 

argmax s(x, (2) 

or fo fhe relafed problem of senfence compression 
fhaf concenfrafes on deleting words from (he inpuf: 

argmax s(x, (3) 

mS {!,... AT}^,mi_ 1 <mi 

While absfracfive summarization poses a more dif- 
ficulf generation challenge, (he lack of hard con- 
sfrainfs gives (he sysfem more freedom in genera¬ 
tion and allows if fo (if wifh a wider range of (rain¬ 
ing dafa. 

In fhis work we focus on facfored scoring func¬ 
tions, s, fhaf fake info accounf a fixed window of 
previous words: 

N-l 

S(x,y) ^ 5r(yi+i,x,yc), (4) 

i=0 

^For the DUC-2004 evaluation, it is actually the number 
of bytes of the output that is capped. More detail is given in 
Section 7. 

^Unfortunately the literature is inconsistent on the formal 
definition of this distinction. Some systems self-described as 
abstractive would be extractive under our definition. 




where we define yc = for ^ window 

of size C. 

In particular consider the conditional log- 
probability of a summary given the input, 
s(x, y) = logp(y|x; 6). We can write this as: 

Af-l 

logp(y|x;6') ^ logp(y*+i|x, yc; 6»), 
i=0 

where we make a Markov assumption on the 
length of the context as size C and assume for 
i < 1, yi is a special start symbol (5). 

With this scoring function in mind, our main 
focus will be on modelling the local conditional 
distribution: p(yi+i|x, yc; 0). The next section 
defines a parameterization for this distribution, in 
Section 4, we return to the question of generation 
for factored models, and in Section 5 we introduce 
a modified factored scoring function. 

3 Model 

The distribution of interest, p(yj+i|x, yc; 0), is 
a conditional language model based on the in¬ 
put sentence x. Past work on summarization and 
compression has used a noisy-channel approach to 
split and independently estimate a language model 
and a conditional summarization model (Banko et 
ah, 2000; Knight and Marcu, 2002; Daume III and 
Marcu, 2002), i.e., 

argmaxlogp(y|x) = argmaxlogp(y)p(x|y) 
y y 

where p{y) and p{x\y) are estimated separately. 
Here we instead follow work in neural machine 
translation and directly parameterize the original 
distribution as a neural network. The network con¬ 
tains both a neural probabilistic language model 
and an encoder which acts as a conditional sum¬ 
marization model. 

3.1 Neural Language Model 

The core of our parameterization is a language 
model for estimating the contextual probability of 
the next word. The language model is adapted 
from a standard feed-forward neural network lan¬ 
guage model (NNLM), particularly the class of 
NNLMs described by Bengio et al. (2003). The 
full model is: 

P(yj-ri|yc,x;6») oc exp(Vli-f Wenc(x, yc)), 

yc = [Eyi_c+i,... ,Eyi], 

h = tanh(Uyc)- 



(a) (b) 

Figure 3: (a) A network diagram for the NNLM decoder 
with additional encoder element, (b) A network diagram for 
the attention-based encoder enca. 

The parameters are 6* = (E, U, V, W) where 
E G is a word embedding matrix, U G 

^(CD)xH^ V G W G are weight 

matrices,"* D is the size of the word embeddings, 
and h is a hidden layer of size H. The black-box 
function enc is a contextual encoder term that re¬ 
turns a vector of size H representing the input and 
current context; we consider several possible vari¬ 
ants, described subsequently. Figure 3 a gives a 
schematic representation of the decoder architec¬ 
ture. 

3.2 Encoders 

Note that without the encoder term this represents 
a standard language model. By incorporating in 
enc and training the two elements jointly we cru¬ 
cially can incorporate the input text into genera¬ 
tion. We discuss next several possible instantia¬ 
tions of the encoder. 

Bag-of-Words Encoder Our most basic model 
simply uses the bag-of-words of the input sentence 
embedded down to size H, while ignoring proper¬ 
ties of the original order or relationships between 
neighboring words. We write this model as: 

enci(x,yc) = p"^x, 

p = [1/M,..., 1/M], 

X = [Fxi,... ,Fxm]- 

Where the input-side embedding matrix F G 

^HxV 

is the only new parameter of the encoder 
and p G [0,1]^ is a uniform distribution over the 
input words. 

■^Each of the weight matrices U, V, W also has a cor¬ 
responding bias term. For readability, we omit these terms 
throughout the paper. 













For summarization this model can capture the 
relative importance of words to distinguish con¬ 
tent words from stop words or embellishments. 
Potentially the model can also learn to combine 
words; although it is inherently limited in repre¬ 
senting contiguous phrases. 

Convolutional Encoder To address some of the 
modelling issues with bag-of-words we also con¬ 
sider using a deep convolutional encoder for the 
input sentence. This architecture improves on the 
bag-of-words model by allowing local interactions 
between words while also not requiring the con¬ 
text Yc while encoding the input. 

We utilize a standard time-delay neural network 
(TDNN) architecture, alternating between tempo¬ 
ral convolution layers and max pooling layers. 

Vj, enc2(x,yc)j = maxx,^ , (5) 

I 

Vi,/e {!,...!/}, x' = tanh(max{x; 2 i_i,X 2 i}), 

( 6 ) 

Mi,I L}, x' = Q'5cj"_^Q_ (7) 

= [Fxi,...,Fxm]. (8) 

Where F is a word embedding matrix and 
Q^x77x2Q-i-i consists of a set of filters for each 
layer {1,... L], Eq. 7 is a temporal (ID) convolu¬ 
tion layer, Eq. 6 consists of a 2-element temporal 
max pooling layer and a pointwise non-linearity, 
and final output Eq. 5 is a max over time. At each 
layer x is one half the size of x. Eor simplicity 
we assume that the convolution is padded at the 
boundaries, and that M is greater than 2^ so that 
the dimensions are well-defined. 

Attention-Based Encoder While the convolu¬ 
tional encoder has richer capacity than bag-of- 
words, it still is required to produce a single rep¬ 
resentation for the entire input sentence. A simi¬ 
lar issue in machine translation inspired Bahdanau 
et al. (2014) to instead utilize an attention-based 
contextual encoder that constructs a representation 
based on the generation context. Here we note that 
if we exploit this context, we can actually use a 
rather simple model similar to bag-of-words: 

enc3(x,yc) = p^x, 

p oc exp(xPy[,), 

X = [Fxi,... ,Fxm], 

y'c = [Gyi-c-ri,---,Gyi], 

i+Q 

yi Xi = ^ Xi/Q. 

q=i—Q 


Where G G embedding of the con¬ 

text, P G M-^xlG-D) is a new weight matrix pa¬ 
rameter mapping between the context embedding 
and input embedding, and Q is a smoothing win¬ 
dow. The full model is shown in Figure 3b. 

Informally we can think of this model as simply 
replacing the uniform distribution in bag-of-words 
with a learned soft alignment, P, between the in¬ 
put and the summary. Figure 1 shows an exam¬ 
ple of this distribution p as a summary is gener¬ 
ated. The soft alignment is then used to weight 
the smoothed version of the input x when con¬ 
structing the representation. For instance if the 
current context aligns well with position i then 
the words Xj_Q,..., Xj+g are highly weighted 
by the encoder. Together with the NNEM, this 
model can be seen as a stripped-down version 
of the attention-based neural machine translation 
model. ^ 

3.3 Training 

The lack of generation constraints makes it pos¬ 
sible to train the model on arbitrary input-output 
pairs. Once we have defined the local condi¬ 
tional model, p(yj+i|x, yc; 0), we can estimate 
the parameters to minimize the negative log- 
likelihood of a set of summaries. Define this train¬ 
ing set as consisting of J input-summary pairs 
(x(^\ y(^)),..., (x^"^),y^-^)). The negative log- 
likelihood conveniently factors^ into a term for 
each token in the summary: 


j 

NLL(6i) = 

2 = 1 

J N-1 

2 = 1 i = l 


We minimize NEE by using mini-batch stochastic 
gradient descent. The details are described further 
in Section 7. 

^To be explicit, compared to Bahdanau et al. (2014) 
our model uses an NNLM instead of a target-side LSTM, 
source-side windowed averaging instead of a source-side bi¬ 
directional RNN, and a weighted dot-product for alignment 
instead of an alignment MLR 

®This is dependent on using the gold standard contexts 
yc. An alternative is to use the predicted context within a 
structured or reenforcement-leaming style objective. 



4 Generating Summaries 


5 Extension: Extractive Tuning 


We now return to the problem of generating sum¬ 
maries. Reeall from Eq. 4 that our goal is to find, 

7V-1 

y* = arg max V g{yi+i , x, yc). 

yey 1^0 

Unlike phrase-based maehine translation where 
inferenee is NP-hard, it aetually is traetable in the¬ 
ory to eompute y*. Sinee there is no explieit hard 
alignment eonstraint, Viterbi deeoding ean be ap¬ 
plied and requires 0{NV^') time to find an exaef 
solufion. In praefiee fhough V is large enough fo 
make Ibis diffieull. An alfernafive approaeh is fo 
approximafe fhe arg max wifh a sfriefly greedy or 
deterministic decoder. 

A compromise between exact and greedy de¬ 
coding is to use a beam-search decoder (Algo¬ 
rithm 1) which maintains the full vocabulary V 
while limiting itself to K potential hypotheses at 
each position of the summary. This has been the 
standard approach for neural MT models (Bah- 
danau et ah, 2014; Sutskever et ah, 2014; Luong 
et ah, 2015). The beam-search algorithm is shown 
here, modified for fhe feed-forward model: 


Algorithm 1 Beam Search 

Input: Parameters 9, beam size K, input x 
Output: Approx. A-best summaries 
Trio] t— {e} 

iS = V if abstractive else {xi | Vi} 
for i = 0 to A — 1 do 
> Generate Hypotheses 
A" t-{[y,yi+i] I y e vr[i],yi+i e 5} 


> Hypothesis Recombination 

y€JV I s(y,x) > s(y',x) 
Vy' e J\f s.t. yc = y'c 


H 


> Filter K-Max 

7 r[z + 1] ^ K-argmaxp(yi+i,yc,x) + s(y,x) 
yGH 

end for 
return 7 r[A] 


As with Viterbi this beam search algorithm is 
much simpler than beam search for phrase-based 
MT. Because there is no explicit constraint that 
each source word be used exactly once there is 
no need to maintain a bit set and we can sim¬ 
ply move from left-to-right generating words. The 
beam search algorithm requires 0{KNV) time. 
From a computational perspective though, each 
round of beam search is dominated by computing 
p(yj|x, yc) for each of the K hypotheses. These 
can be computed as a mini-batch, which in prac¬ 
tice greatly reduces the factor of K. 


While we will see that the attention-based model 
is effective at generating summaries, it does miss 
an important aspect seen in the human-generated 
references. In particular the abstractive model 
does not have the capacity to find extractive word 
matches when necessary, for example transferring 
unseen proper noun phrases from the input. Simi¬ 
lar issues have also been observed in neural trans¬ 
lation models particularly in terms of translating 
rare words (Luong et ah, 2015). 

To address this issue we experiment with tuning 
a very small set of additional features that trade¬ 
off the abstractive/extractive tendency of the sys¬ 
tem. We do this by modifying our scoring function 
to directly estimate the probability of a summary 
using a log-linear model, as is standard in machine 
translation: 

7V-1 

p(y|x;6»,a) oc exp(a’^ ^/(yi+i, x,yc)). 

i=0 

Where a G is a weight vector and / is a fea¬ 
ture function. Finding the best summary under this 
distribution corresponds to maximizing a factored 
scoring function s, 

Af-l 

s(y:x) = ^ a’^/(yi+i,x,yc). 

i=0 

where p(yi+i,x,yc) = a’^/(yi+i,x,yc) to sat¬ 
isfy Eq. 4. The function / is defined to combine 
the local conditional probability with some addi¬ 
tional indicator featrues: 

/(yi+i,x,yc) = [logp(yi+i|x,yc;6'), 
l{3j. yi+i = Xj }, 
l{3j. Yi+i-k = ^j-k V/c G { 0 , 1 }}, 
l{3j. Yi+i-k = ^j-k V/c G {0,1, 2}}, 
l{3k > j. Yi = Xfc,yi+i = Xj} ]. 

These features correspond to indicators of uni¬ 
gram, bigram, and trigram match with the input as 
well as reordering of input words. Note that set¬ 
ting a = (1, 0,..., 0) gives a model identical to 
standard Abs. 

After training the main neural model, we fix 0 
and tune the a parameters. We follow the statis¬ 
tical machine translation setup and use minimum- 
error rate training (MERT) to tune for the summa¬ 
rization metric on tuning data (Och, 2003). This 
tuning step is also identical to the one used for the 
phrase-based machine translation baseline. 



6 Related Work 

Abstractive sentence summarization has been tra¬ 
ditionally connected to the task of headline gener¬ 
ation. Our work is similar to early work of Banko 
et al. (2000) who developed a statistical machine 
translation-inspired approach for this task using a 
corpus of headline-article pairs. We extend this 
approach by: (1) using a neural summarization 
model as opposed to a count-based noisy-channel 
model, (2) training the model on much larger scale 
(25K compared to 4 million articles), (3) and al¬ 
lowing fully abstractive decoding. 

This task was standardized around the DUC- 
2003 and DUC-2004 competitions (Over et al., 
2007). The Topiary system (Zajic et al., 2004) 
performed the best in this task, and is described in 
detail in the next section. We point interested read¬ 
ers to the DUG web page (http : / /due . nist. 
gov/) for the full list of systems entered in this 
shared task. 

More recently, Cohn and Lapata (2008) give a 
compression method which allows for more ar¬ 
bitrary transformations. They extract tree trans¬ 
duction rules from aligned, parsed texts and learn 
weights on transfomations using a max-margin 
learning algorithm. Woodsend et al. (2010) pro¬ 
pose a quasi-synchronous grammar approach uti¬ 
lizing both context-free parses and dependency 
parses to produce legible summaries. Both of 
these approaches differ from ours in that they di¬ 
rectly use the syntax of the input/output sentences. 
The latter system is W&L in our results; we at¬ 
tempted to train the former system T3 on this 
dataset but could not train it at scale. 

In addition to Banko et al. (2000) there has been 
some work using statistical machine translation 
directly for abstractive summary. Wubben et al. 
(2012) utilize MOSES directly as a method for text 
simplification. 

Recently Filippova and Altun (2013) developed 
a strictly extractive system that is trained on a rel¬ 
atively large corpora (250K sentences) of article- 
title pairs. Because their focus is extractive com¬ 
pression, the sentences are transformed by a series 
of heuristics such that the words are in monotonic 
alignment. Our system does not require this align¬ 
ment step but instead uses the text directly. 

Neural MT This work is closely related to re¬ 
cent work on neural network language models 
(NNLM) and to work on neural machine transla¬ 


tion. The core of our model is a NNLM based on 
that of Bengio et al. (2003). 

Recently, there have been several papers about 
models for machine translation (Kalchbrenner and 
Blunsom, 2013; Cho et al., 2014; Sutskever et al., 
2014). Of these our model is most closely related 
to the attention-based model of Bahdanau et al. 
(2014), which explicitly finds a soff alignmenf be- 
fween fhe currenf posifion and fhe inpuf source. 
Mosf of fhese models utilize recurrenf neural nef- 
works (RNNs) for generafion as opposed fo feed¬ 
forward models. We hope fo incorporafe an RNN- 
LM in fufure work. 

7 Experimental Setup 

We experimenf wifh our affenfion-based senfence 
summarizafion model on fhe fask of headline gen- 
erafion. In fhis seefion we describe fhe corpora 
used for this task, the baseline methods we com¬ 
pare with, and implementation details of our ap¬ 
proach. 

7,1 Data Set 

The standard sentence summarization evaluation 
set is associated with the DUC-2003 and DUC- 
2004 shared tasks (Over et al., 2007). The 
data for this task consists of 500 news arti¬ 
cles from the New York Times and Associated 
Press Wire services each paired with 4 different 
human-generated reference summaries (not actu¬ 
ally headlines), capped at 75 bytes. This data 
set is evaluation-only, although the similarly sized 
DUC-2003 data set was made available for the 
task. The expectation is for a summary of roughly 
14 words, based on the text of a complete arti¬ 
cle (although we only make use of the first sen¬ 
tence). The full data set is available by request at 
http://due.nist.gov/data.html. 

For this shared task, systems were entered and 
evaluated using several variants of the recall- 
oriented ROUGE metric (Lin, 2004). To make 
recall-only evaluation unbiased to length, out¬ 
put of all systems is cut-off after 75-characters 
and no bonus is given for shorter summaries. 
Unlike BLEU which interpolates various n-gram 
matches, there are several versions of ROUGE 
for different match lengths. The DUG evaluation 
uses ROUGE-1 (unigrams), ROUGE-2 (bigrams), 
and ROUGE-L (longest-common substring), all of 
which we report. 

In addition to the standard DUC-2014 evalu- 



ation, we also report evaluation on single refer- 
enee headline-generation using a randomly hold¬ 
out subset of Gigaword. This evaluation is eloser 
to the task the model is trained for, and it allows 
us to use a bigger evaluation set, whieh we will in- 
elude in our eode release. For this evaluation, we 
tune systems to generate output of the average title 
length. 

For training data for both tasks, we utilize the 
annotated Gigaword data set (Graff et ah, 2003; 
Napoles et ah, 2012), whieh eonsists of standard 
Gigaword, preproeessed with Stanford CoreNLP 
tools (Manning et ah, 2014). Our model only uses 
annotations for tokenization and sentenee separa¬ 
tion, although several of the baselines use parsing 
and tagging as well. Gigaword eontains around 9.5 
million news artieles soureed from various domes- 
tie and international news serviees over the last 
two deeades. 

For our training set, we pair the headline of eaeh 
artiele with its first sentenee to ereate an input- 
summary pair. While the model eould in theory be 
trained on any pair, Gigaword eontains many spu¬ 
rious headline-artiele pairs. We therefore prune 
training based on the following heuristie filters: 
(1) Are there no non-stop-words in common? (2) 
Does the title contain a byline or other extrane¬ 
ous editing marks? (3) Does the title have a ques¬ 
tion mark or colon? After applying these filters, 
the training set consists of roughly J = 4 million 
title-article pairs. We apply a minimal preprocess¬ 
ing step using PTB tokenization, lower-casing, re¬ 
placing all digit characters with #, and replacing 
of word types seen less than 5 times with UNK. 
We also remove all articles from the time-period 
of the DUG evaluation, release. 

The complete input training vocabulary consists 
of 119 million word tokens and 1 lOK unique word 
types with an average sentence size of 31.3 words. 
The headline vocabulary consists of 31 million to¬ 
kens and 69K word types with the average title 
of length 8.3 words (note that this is significantly 
shorter than the DUG summaries). On average 
there are 4.6 overlapping word types between the 
headline and the input; although only 2.6 in the 
first 75-characters of the input. 

7.2 Baselines 

Due to the variety of approaches to the sentence 
summarization problem, we report a broad set of 
headline-generation baselines. 


From the DUC-2004 task we include the Pre¬ 
fix baseline that simply returns the first 75- 
characters of the input as the headline. We 
also report the winning system on this shared 
task. Topiary (Zajic et ah, 2004). Topiary 
merges a compression system using linguistically- 
motivated transformations of the input (Dorr et ah, 
2003) with an unsupervised topic detection (UTD) 
algorithm that appends key phrases from the full 
article onto the compressed output. Woodsend et 
al. (2010) (described above) also report results on 
the DUG dataset. 

The DUG task also includes a set of manual 
summaries performed by 8 human summarizers 
each summarizing half of the test data sentences 
(yielding 4 references per sentence). We report the 
average inter-annotater agreement score as Ref¬ 
erence. For reference, the best human evaluator 
scores 31.7 ROUGE-1. 

We also include several baselines that have ac¬ 
cess to the same training data as our system. The 
first is a sentence compression baseline COM¬ 
PRESS (Clarke and Lapata, 2008). This model 
uses the syntactic structure of the original sentence 
along with a language model trained on the head¬ 
line data to produce a compressed output. The 
syntax and language model are combined with a 
set of linguistic constraints and decoding is per¬ 
formed with an ILP solver. 

To control for memorizing titles from training, 
we implement an information retrieval baseline, 
IR. This baseline indexes the training set, and 
gives the title for the article with highest BM-25 
match to the input (see Manning et al. (2008)). 

Finally, we use a phrase-based statistical ma¬ 
chine translation system trained on Gigaword 
to produce summaries, MOSES-I- (Koehn et ah, 
2007). To improve the baseline for this task, we 
augment the phrase table with “deletion” rules 
mapping each article word to e, include an addi¬ 
tional deletion feature for these rules, and allow 
for an infinite distortion limit. We also explic¬ 
itly tune the model using MERT to target the 75- 
byte capped ROUGE score as opposed to standard 
BEEU-based tuning. Unfortunately, one remain¬ 
ing issue is that it is non-trivial to modify the trans¬ 
lation decoder to produce fixed-length outputs, so 
we tune the system to produce roughly the ex¬ 
pected length. 



Model 

ROUGE-1 

DUC-2004 

ROUGE-2 

ROUGE-L 

ROUGE-1 

Gigaword 

ROUGE-2 ROUGE-L 

Ext. % 

IR 

11.06 

1.67 

9.67 

16.91 

5.55 

15.58 

29.2 

Prefix 

22.43 

6.49 

19.65 

23.14 

8.25 

21.73 

100 

Compress 

19.77 

4.02 

17.30 

19.63 

5.13 

18.28 

100 

W&L 

22 

6 

17 

- 

- 

- 

- 

Topiary 

25.12 

6.46 

20.12 

- 

- 

- 

- 

MOSES + 

26.50 

8.13 

22.85 

28.77 

12.10 

26.44 

70.5 

Abs 

26.55 

7.06 

22.05 

30.88 

12.22 

27.77 

85.4 

ABS + 

28.18 

8.49 

23.81 

31.00 

12.65 

28.34 

91.5 

Reference 

29.21 

8.38 

24.46 

- 

- 

- 

45.6 


Table 11 Experimental results on the main summary tasks on various ROUGE metrics . Baseline models are described in 
detail in Section 7.2. We report the percentage of tokens in the summary that also appear in the input for Gigaword as Ext %. 


7.3 Implementation 

For training, we use mini-bateh stoehastie gradient 
deseent to minimize negative log-likelihood. We 
use a learning rate of 0.05, and split the learning 
rate by half if validation log-likelihood does not 
improve for an epoeh. Training is performed with 
shuffled mini-batehes of size 64. The minibatehes 
are grouped by input length. After eaeh epoeh, we 
renormalize the embedding tables (Hinton et al., 
2012). Based on the validation set, we set hyper¬ 
parameters as D = 200, H = 400, C = 5, L = 3, 
and Q = 2. 

Our implementation uses the Toreh numerieal 
framework (http: / /torch . ch/) and will be 
openly available along with the data pipeline. Cru- 
eially, training is performed on GPUs and would 
be intraetable or require approximations other¬ 
wise. Proeessing 1000 mini-hatches with D = 
200, H = 400 requires 160 seconds. Best valida¬ 
tion accuracy is reached after 15 epochs through 
the data, which requires around 4 days of training. 

Additionally, as described in Section 5 we apply 
a MERT tuning step after training using the DUC- 
2003 data. For this step we use Z-MERT (Zaidan, 
2009). We refer to the main model as Abs and the 
tuned model as Abs-i-. 

8 Results 

Our main results are presented in Table 1. We 
run experiments both using the DUC-2004 eval¬ 
uation data set (500 sentences, 4 references, 75 
bytes) with all systems and a randomly held-out 
Gigaword test set (2000 sentences, 1 reference). 
We first note that the baselines COMPRESS and IR 
do relatively poorly on both datasets, indicating 
that neither just having article information or lan¬ 
guage model information alone is sufficient for the 
task. The Prefix baseline actually performs sur¬ 


prisingly well on ROUGE-1 which makes sense 
given the earlier observed overlap between article 
and summary. 

Both Abs and MOSES-I- perform better 
than Topiary, particularly on ROUGE-2 and 
ROUGE-E in DUG. The full model Abs-i- scores 
the best on these tasks, and is significantly better 
based on the default ROUGE confidence level 
than Topiary on all metrics, and MOSES-I- on 
ROUGE-1 for DUG as well as ROUGE-1 and 
ROUGE-E for Gigaword. Note that the additional 
extractive features bias the system towards re¬ 
taining more input words, which is useful for the 
underlying metric. 

Next we consider ablations to the model and al¬ 
gorithm structure. Table 2 shows experiments for 
the model with various encoders. For these exper¬ 
iments we look at the perplexity of the system as 
a language model on validation data, which con¬ 
trols for the variable of inference and tuning. The 
NNEM language model with no encoder gives a 
gain over the standard n-gram language model. 
Including even the bag-of-words encoder reduces 
perplexity number to below 50. Both the convo¬ 
lutional encoder and the attention-based encoder 
further reduce the perplexity, with attention giving 
a value below 30. 

We also consider model and decoding ablations 
on the main summary model, shown in Table 3. 
These experiments compare to the BoW encoding 
models, compare beam search and greedy decod¬ 
ing, as well as restricting the system to be com¬ 
plete extractive. Of these features, the biggest im¬ 
pact is from using a more powerful encoder (atten¬ 
tion versus BoW), as well as using beam search to 
generate summaries. The abstractive nature of the 
system helps, but for ROUGE even using pure ex¬ 
tractive generation is effective. 



Model 

Encoder 

Perplexity 

KN-Smoothed 5-Gram 

none 

183.2 

Feed-Forward NNLM 

none 

145.9 

Bag-of-Word 

enci 

43.6 

Convolutional (TDNN) 

enc2 

35.9 

Attention-Based (Abs) 

enc3 

27.1 


Table 2: Perplexity results on the Gigaword validation 
set comparing various language models with C=5 and end- 
to-end summarization models. The encoders are dehned in 
Section 3. 


Decoder 

Model 

Cons. 

R-1 

R-2 

R-L 

Greedy 

Abs-i- 

Abs 

26.67 

6.72 

21.70 

Beam 

BoW 

Abs 

22.15 

4.60 

18.23 

Beam 

Abs-i- 

Ext 

27.89 

7.56 

22.84 

Beam 

Abs-i- 

Abs 

28.48 

8.91 

23.97 


Table 3l ROUGE scores on DUC-2003 development data 
for various versions of inference. Greedy and Beam are de¬ 
scribed in Section 4. Ext. is a purely extractive version of the 
system (Eq. 2) 

Finally we eonsider example summaries shown 
in Figure 4. Despite improving on the base¬ 
line seores, this model is far from human per- 
formanee on this task. Generally the models are 
good at pieking out key words from the input, 
sueh as names and plaees. However, both models 
will reorder words in syntaetieally ineorreet ways, 
for instanee in Sentenee 7 both models have the 
wrong subjeet. Abs often uses more interesting 
re-wording, for instanee new nz pm after election 
in Sentenee 4, but this ean also lead to attachment 
mistakes such a russian oil giant chevron in Sen¬ 
tence 11. 

9 Conclusion 

We have presented a neural attention-based model 
for abstractive summarization, based on recent de¬ 
velopments in neural machine translation. We 
combine this probabilistic model with a genera¬ 
tion algorithm which produces accurate abstrac¬ 
tive summaries. As a next step we would like 
to further improve the grammaticality of the sum¬ 
maries in a data-driven way, as well as scale this 
system to generate paragraph-level summaries. 
Both pose additional challenges in terms of effi¬ 
cient alignment and consistency in generation. 
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